Análisis Coronavirus

Cristina Sierra

In [1]:
import pandas as pd
import numpy as np 
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import requests
import json
import time
import xgboost as xgb
from xgboost import XGBRegressor
from xgboost import plot_importance, plot_tree
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn import datasets, linear_model
import datetime
from sklearn.calibration import CalibratedClassifierCV 
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
In [2]:
import warnings
warnings.filterwarnings("ignore")

Datos

Fuente: Kaggle, Organización Mundial de la Salud

Para el análisis del problema se utilizaron dos fuentes de información. Kaggle, del cual se obtuvieron dos datasets, uno con información de la propagación del virus durante dos meses, y otro con información poblacional de los países. Adicionalmente, de la OMS se obtuvieron datos asociados a condiciones de salud en los países.

In [3]:
df = pd.read_csv('C:/Users/Usuario/Downloads/corona-virus-report/covid_19_clean_complete.csv',parse_dates=['Date'])
df_pop = pd.read_csv('C:/Users/Usuario/Downloads/population_by_country_2020.csv')
df_oms = pd.read_csv('C:/Users/Usuario/Downloads/oms_info.csv')
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
print(f"Rows: {df_pop.shape[0]}, Columns: {df_pop.shape[1]}")
print(f"Rows: {df_oms.shape[0]}, Columns: {df_oms.shape[1]}")
Rows: 19220, Columns: 8
Rows: 235, Columns: 11
Rows: 283, Columns: 8

Limpieza de datos

In [4]:
df.head()
Out[4]:
Province/State Country/Region Lat Long Date Confirmed Deaths Recovered
0 NaN Thailand 15.0000 101.0000 2020-01-22 2.0 0.0 0.0
1 NaN Japan 36.0000 138.0000 2020-01-22 2.0 0.0 0.0
2 NaN Singapore 1.2833 103.8333 2020-01-22 0.0 0.0 0.0
3 NaN Nepal 28.1667 84.2500 2020-01-22 0.0 0.0 0.0
4 NaN Malaysia 2.5000 112.5000 2020-01-22 0.0 0.0 0.0
In [5]:
df.describe().T
Out[5]:
count mean std min 25% 50% 75% max
Lat 19220.0 25.088643 23.448518 -41.4545 12.5186 31.221 42.1657 72.000
Long 19220.0 5.289362 80.081866 -157.4983 -70.1627 9.775 57.5000 178.065
Confirmed 19219.0 287.734586 3297.651489 0.0000 0.0000 0.000 8.0000 67800.000
Deaths 19219.0 10.237994 155.703574 0.0000 0.0000 0.000 0.0000 5476.000
Recovered 19219.0 103.066341 1795.041235 0.0000 0.0000 0.000 0.0000 59879.000
In [6]:
def missing_data(data):
    total = data.isnull().sum()
    percent = (data.isnull().sum()/data.isnull().count()*100)
    tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    types = []
    for col in data.columns:
        dtype = str(data[col].dtype)
        types.append(dtype)
    tt['Types'] = types
    return(np.transpose(tt))
missing_data(df)
Out[6]:
Province/State Country/Region Lat Long Date Confirmed Deaths Recovered
Total 10788 0 0 0 0 1 1 1
Percent 56.129 0 0 0 0 0.00520291 0.00520291 0.00520291
Types object object float64 float64 datetime64[ns] float64 float64 float64
In [7]:
print(f"Countries/Regions:{df['Country/Region'].nunique()}")
print(f"Province/State:{df['Province/State'].nunique()}")
Countries/Regions:183
Province/State:134

El tratamiento a datos faltantes en el caso de Province/State son imputados con la información de Country/State. Los demás datos se imputaron con el valor 0.

In [8]:
df['Province/State'] = df['Province/State'].fillna(df['Country/Region'])
df.fillna(0, inplace=True)
In [9]:
print(f"Dato más antiguo: {df['Date'].min()}")
print(f"Dato más reciente:     {df['Date'].max()}")
print(f"Total días:     {df['Date'].max() - df['Date'].min()}")
Dato más antiguo: 2020-01-22 00:00:00
Dato más reciente:     2020-03-23 00:00:00
Total días:     61 days 00:00:00
In [10]:
df_pop.head()
Out[10]:
Country (or dependency) Population (2020) Yearly Change Net Change Density (P/Km²) Land Area (Km²) Migrants (net) Fert. Rate Med. Age Urban Pop % World Share
0 China 1439323776 0.39% 5540090 153 9388211 -348399.0 1.7 38 61% 18.47%
1 India 1380004385 0.99% 13586631 464 2973190 -532687.0 2.2 28 35% 17.70%
2 United States 331002651 0.59% 1937734 36 9147420 954806.0 1.8 38 83% 4.25%
3 Indonesia 273523615 1.07% 2898047 151 1811570 -98955.0 2.3 30 56% 3.51%
4 Pakistan 220892340 2.00% 4327022 287 770880 -233379.0 3.6 23 35% 2.83%
In [11]:
df_pop.describe().T
Out[11]:
count mean std min 25% 50% 75% max
Population (2020) 235.0 3.316936e+07 1.351374e+08 801.0 398876.0 5459642.0 20577053.0 1.439324e+09
Net Change 235.0 3.460878e+05 1.128260e+06 -383840.0 424.0 39170.0 249660.0 1.358663e+07
Density (P/Km²) 235.0 4.757702e+02 2.331286e+03 0.0 37.0 95.0 239.5 2.633700e+04
Land Area (Km²) 235.0 5.535918e+05 1.687796e+06 0.0 2545.0 77240.0 403820.0 1.637687e+07
Migrants (net) 201.0 6.283582e+00 1.232919e+05 -653249.0 -10047.0 -852.0 9741.0 9.548060e+05
In [12]:
missing_data(df_pop)
Out[12]:
Country (or dependency) Population (2020) Yearly Change Net Change Density (P/Km²) Land Area (Km²) Migrants (net) Fert. Rate Med. Age Urban Pop % World Share
Total 0 0 0 0 0 0 34 0 0 0 0
Percent 0 0 0 0 0 0 14.4681 0 0 0 0
Types object int64 object int64 int64 int64 float64 object object object object
In [13]:
df_oms.head()
Out[13]:
Country Year Total density per 100 000 population: Health posts Total density per 100 000 population: Health centres Total density per 100 000 population: District/rural hospitals Total density per 100 000 population: Provincial hospitals Total density per 100 000 population: Specialized hospitals Total density per 100 000 population: Hospitals
0 Afghanistan 2013 2.95 1.22 0.18 0.10 0.09 0.37
1 Afghanistan 2010 2.87 1.19 0.18 0.10 0.08 NaN
2 Albania 2013 13.08 NaN 0.72 0.35 0.28 1.36
3 Albania 2010 12.95 NaN 0.72 0.34 0.28 NaN
4 Angola 2010 NaN NaN NaN 0.00 0.00 NaN
In [14]:
df_oms = df_oms.groupby('Country').first().reset_index()
df_oms.shape
Out[14]:
(144, 8)
In [15]:
df_oms.describe().T
Out[15]:
count mean std min 25% 50% 75% max
Year 144.0 2012.951389 0.447680 2010.0 2013.0000 2013.000 2013.000 2014.00
Total density per 100 000 population: Health posts 111.0 22.173243 38.409266 0.0 3.8250 11.370 21.710 272.64
Total density per 100 000 population: Health centres 112.0 6.956964 17.674769 0.0 0.4950 1.775 5.515 140.69
Total density per 100 000 population: District/rural hospitals 125.0 1.150640 2.508633 0.0 0.2500 0.500 1.300 25.64
Total density per 100 000 population: Provincial hospitals 128.0 0.781094 1.547138 0.0 0.0875 0.305 0.705 10.57
Total density per 100 000 population: Specialized hospitals 128.0 0.487656 2.723176 0.0 0.0400 0.115 0.350 30.81
Total density per 100 000 population: Hospitals 137.0 2.235401 5.059961 0.0 0.5500 1.110 2.250 56.45
In [16]:
missing_data(df_oms)
Out[16]:
Country Year Total density per 100 000 population: Health posts Total density per 100 000 population: Health centres Total density per 100 000 population: District/rural hospitals Total density per 100 000 population: Provincial hospitals Total density per 100 000 population: Specialized hospitals Total density per 100 000 population: Hospitals
Total 0 0 33 32 19 16 16 7
Percent 0 0 22.9167 22.2222 13.1944 11.1111 11.1111 4.86111
Types object int64 float64 float64 float64 float64 float64 float64
In [17]:
df_oms.fillna(0, inplace = True)
missing_data(df_oms)
Out[17]:
Country Year Total density per 100 000 population: Health posts Total density per 100 000 population: Health centres Total density per 100 000 population: District/rural hospitals Total density per 100 000 population: Provincial hospitals Total density per 100 000 population: Specialized hospitals Total density per 100 000 population: Hospitals
Total 0 0 0 0 0 0 0 0
Percent 0 0 0 0 0 0 0 0
Types object int64 float64 float64 float64 float64 float64 float64

Evolución del virus

Cálculo de casos activos con base en los casos confirmados, muertes y recuperaciones

In [18]:
df['Active']=df['Confirmed']-df['Deaths']-df['Recovered']

Evolución de los casos posibles y los tres posibles estados del virus.

In [19]:
df_grouped = df.groupby('Date')['Date', 'Deaths', 'Recovered','Active','Confirmed'].sum().reset_index()
fig = make_subplots(rows=2, cols=2)
fig.add_trace(go.Scatter(x=df_grouped['Date'], y=df_grouped['Deaths'],
                name="Deaths",  line_color='red',
                opacity=0.8), row=2, col=2)

fig.add_trace(go.Scatter(x=df_grouped['Date'], y=df_grouped['Recovered'],
                name="Recovered", line_color='gray',
                opacity=0.8), row=1, col=2)

fig.add_trace(go.Scatter( x=df_grouped['Date'], y=df_grouped['Active'],
                name="Active", line_color='black',
                opacity=0.8), row=2, col=1)

fig.add_trace(go.Scatter( x=df_grouped['Date'], y=df_grouped['Confirmed'],
                name="Confirmed", line_color='blue',
                opacity=0.8), row=1, col=1)

fig.update_layout(height=600, width=800, title_text="Coronavirus Evolution")
fig.show()

Se observa que todas las variables han mostrado un crecimiento exponencial. En particular los casos activos tuvieron una reducción en la mitad del periodo de estudio que está asociado a la recuperación de casos en China.

Evolución casos confirmados top 10

In [20]:
b = df[['Country/Region','Confirmed','Date']].groupby(['Country/Region']).sum().reset_index()
b.sort_values(by='Confirmed', ascending=False, inplace=True)
df_grouped_top = df[df['Country/Region'].isin(b['Country/Region'].head(10))].groupby(['Date', 'Country/Region'])['Date','Country/Region', 'Deaths', 'Recovered','Active','Confirmed'].sum().reset_index()

fig = px.line(df_grouped_top, x="Date", y="Confirmed", color='Country/Region',
              title='Evolución casos confirmados para top 10 países')
fig.show()

Casos confirmados con transformación lograrítimica.

In [21]:
df_grouped_top['Confirmed_log']=df_grouped_top['Confirmed'].apply(lambda x:np.log(x))
fig = px.line(df_grouped_top, x="Date", y="Confirmed_log", color='Country/Region',
             title='Evolución casos confirmados para top 10 países (log)')
fig.show()

Se observa como China es el único país en fase de recuperación, en el que los contagios ya no crecen de manera exponencial, a diferencia de los demás países en los que crece de forma vetiginosa los contagios cada día.

Evolución casos confirmados en países más críticos

In [22]:
b.sort_values(by='Confirmed', ascending=False, inplace=True)
b.head(10)
Out[22]:
Country/Region Confirmed
33 China 3531169.0
86 Italy 497959.0
82 Iran 252770.0
155 Spain 186200.0
154 South Korea 181699.0
64 Germany 160974.0
172 US 156669.0
59 France 118509.0
160 Switzerland 46413.0
176 United Kingdom 35365.0
In [23]:
df['Confirmed_log']=df['Confirmed'].apply(lambda x:np.log(x))
china_grouped = df[df['Country/Region'] == 'China'].reset_index()
china_grouped_ = china_grouped.groupby('Date')['Date', 'Deaths', 'Recovered','Active','Confirmed','Confirmed_log'].sum().reset_index()
italy_grouped = df[df['Country/Region'] == 'Italy'].reset_index()
italy_grouped_ = italy_grouped.groupby('Date')['Date', 'Deaths', 'Recovered','Active','Confirmed','Confirmed_log'].sum().reset_index()
iran_grouped = df[df['Country/Region'] == 'Iran'].reset_index()
iran_grouped_ = iran_grouped.groupby('Date')['Date', 'Deaths', 'Recovered','Active','Confirmed','Confirmed_log'].sum().reset_index()
spain_grouped = df[df['Country/Region'] == 'Spain'].reset_index()
spain_grouped_ = spain_grouped.groupby('Date')['Date', 'Deaths', 'Recovered','Active','Confirmed','Confirmed_log'].sum().reset_index()
us_grouped = df[df['Country/Region'] == 'US'].reset_index()
us_grouped_ = us_grouped.groupby('Date')['Date', 'Deaths', 'Recovered','Active','Confirmed','Confirmed_log'].sum().reset_index()
In [24]:
fig = px.line(china_grouped_, x="Date", y="Confirmed_log", title='China')
fig.show()
In [25]:
fig = px.line(italy_grouped_, x="Date", y="Confirmed_log", title='Italy')
fig.show()
In [26]:
fig = px.line(iran_grouped_, x="Date", y="Confirmed_log", title='Iran')
fig.show()
In [27]:
fig = px.line(spain_grouped_, x="Date", y="Confirmed_log", title='Spain')
fig.show()
In [28]:
fig = px.line(us_grouped_, x="Date", y="Confirmed_log", title='US')
fig.show()
In [29]:
df_temp = df[[col for col in df.columns if col != 'Province/State']]

recent = df_temp[df_temp['Date'] == max(df_temp['Date'])].reset_index()
recent_grouped = recent.groupby('Country/Region')['Confirmed', 'Deaths', 'Active', 'Recovered'].sum().reset_index()

Así se ven los países teniendo en cuenta la saturación de casos confirmados:

In [30]:
fig = px.choropleth(recent_grouped, locations="Country/Region", 
                    locationmode='country names', color="Confirmed", 
                    hover_name="Country/Region", range_color=[1,5000], 
                    color_continuous_scale="peach", 
                    title='Países con casos confirmados')

fig.show()

En el caso particular de Europa, así es la situación:

In [31]:
fig = px.choropleth(recent_grouped, locations="Country/Region", 
                    locationmode='country names', color="Confirmed", 
                    hover_name="Country/Region", range_color=[1,5000], 
                    color_continuous_scale="peach", scope='europe',
                    title='Países con casos confirmados Europa')
# fig.update(layout_coloraxis_showscale=False)
fig.show()
In [32]:
fig = px.bar(recent_grouped.sort_values('Confirmed', ascending=False)[:20][::-1], 
             x='Confirmed', y='Country/Region',
             title='Casos confirmados', text='Confirmed', height=1000, orientation='h')
fig.show()

Si bien es cierto, los países más afectados son China e Italia en cuanto a contagios (hasta la fecha). En términos de tasa de muertes y tasa de recuperación los países con peores patrones son los siguientes:

In [33]:
cleaned_recent = df[df['Date'] == max(df['Date'])]
flg = cleaned_recent.groupby('Country/Region')['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()

flg['mortalityRate'] = round((flg['Deaths']/flg['Confirmed'])*100, 2)
temp = flg[flg['Confirmed']>100]
temp = temp.sort_values('mortalityRate', ascending=False)

fig = px.bar(temp.sort_values(by="mortalityRate", ascending=False)[:10][::-1],
             x = 'mortalityRate', y = 'Country/Region', 
             title='Muertes por cada 100 casos confirmados', text='mortalityRate', height=800, orientation='h',
             color_discrete_sequence=['darkred']
            )
fig.show()
In [34]:
flg['recoveryRate'] = round((flg['Recovered']/flg['Confirmed'])*100, 2)
df_temp = flg[flg['Confirmed']>100]
df_temp = df_temp.sort_values('recoveryRate', ascending=False)

fig = px.bar(df_temp.sort_values(by="recoveryRate", ascending=False)[:10][::-1],
             x = 'recoveryRate', y = 'Country/Region', 
             title='Recuperaciones por cada 100 casos confirmados', text='recoveryRate', height=800, orientation='h',
             color_discrete_sequence=['#2ca02c']
            )
fig.show()
In [35]:
print("Países con peores tasas de recuperación")
df_temp = flg[flg['Confirmed']>100]
df_temp = df_temp.sort_values('recoveryRate', ascending=True)[['Country/Region', 'Confirmed','Recovered']][:20]
df_temp.sort_values('Confirmed', ascending=False)[['Country/Region', 'Confirmed','Recovered']][:20]
Países con peores tasas de recuperación
Out[35]:
Country/Region Confirmed Recovered
172 US 33276.0 178.0
119 Netherlands 4216.0 2.0
9 Austria 3244.0 9.0
125 Norway 2383.0 1.0
134 Portugal 1600.0 5.0
21 Brazil 1593.0 2.0
44 Denmark 1514.0 1.0
171 Turkey 1236.0 0.0
43 Czechia 1120.0 6.0
84 Ireland 906.0 5.0
49 Ecuador 789.0 3.0
133 Poland 634.0 1.0
151 Slovenia 414.0 0.0
131 Peru 363.0 1.0
54 Estonia 326.0 2.0
153 South Africa 274.0 0.0
128 Panama 245.0 0.0
147 Serbia 222.0 1.0
47 Dominican Republic 202.0 0.0
177 Uruguay 135.0 0.0

Estados Unidos, llama la atención porque tiene la peor tasa de recuperación. Así es la distribución de casos confirmados entre estados en este país:

In [36]:
df_usa = df[df['Country/Region'] == "US"]
usa_recent = df_usa[df_usa['Date'] == max(df_usa['Date'])]
usa_recent = usa_recent.groupby('Province/State')['Confirmed', 'Deaths'].max().reset_index()

fig = px.bar(usa_recent.sort_values('Confirmed', ascending=False)[:10][::-1], 
             x='Confirmed', y='Province/State', color_discrete_sequence=['#D63230'],
             title='Confirmed Cases in USA', text='Confirmed', orientation='h')
fig.show()

Ahora bien, vale la pena hacer énfasis en Colombia, país que presentó el primer caso de contagio el día 6 de marzo y de allí en adelante el aumento de casos tuvo un alto crecimiento llegando a 230 dos semanas después. Al día 23 de marzo (último día con datos recientes) solo se había presentado dos muertes en el país debido al coronavirus, lo cual indica una tasa de fatalidad baja. Sin embargo, la tasa de recuperación se mantiene igualmente baja, manteniéndose una alta cantidad de casos activos lo cual hace más propensa la propagación exponencial del virus.

In [37]:
co_grouped = df[df['Country/Region'] == 'Colombia'].reset_index()
co_grouped_ = co_grouped.groupby('Date')['Date', 'Deaths', 'Recovered','Active','Confirmed','Confirmed_log'].sum().reset_index()
fig = px.line(co_grouped_, x="Date", y="Confirmed", title='Colombia')
fig.show()
In [38]:
fig = px.line(co_grouped_, x="Date", y="Confirmed_log", title='Colombia')
fig.show()

Finalmente, la propagación del contagio en el planeta a lo largo de los días de estudio se puede visualizar a continuación:

In [39]:
formated_gdf = df.groupby(['Date', 'Country/Region'])['Confirmed', 'Deaths'].max()
formated_gdf = formated_gdf.reset_index()
formated_gdf['Date'] = pd.to_datetime(formated_gdf['Date'])
formated_gdf['Date'] = formated_gdf['Date'].dt.strftime('%m/%d/%Y')
formated_gdf['size'] = formated_gdf['Confirmed'].pow(0.3)

fig = px.scatter_geo(formated_gdf, locations="Country/Region", locationmode='country names', 
                     color="Confirmed", size='size', hover_name="Country/Region", 
                     range_color= [0, 1500], 
                     projection="natural earth", animation_frame="Date", 
                     title='Coronavirus: propagación en el tiempo', color_continuous_scale="portland")
# fig.update(layout_coloraxis_showscale=False)
fig.show()

Predicción de contagio

A continuación se utiliza la información obtenida para analizar las variables que mejor explican la propagación del virus.

In [40]:
df_model = df.merge(df_pop, left_on='Country/Region', right_on='Country (or dependency)')
print(df_model.shape)
df_model = df_model.merge(df_oms, left_on='Country/Region', right_on = 'Country') 
print(df_model.shape)
df_model.head()
(14508, 21)
(9672, 29)
Out[40]:
Province/State Country/Region Lat Long Date Confirmed Deaths Recovered Active Confirmed_log ... Urban Pop % World Share Country Year Total density per 100 000 population: Health posts Total density per 100 000 population: Health centres Total density per 100 000 population: District/rural hospitals Total density per 100 000 population: Provincial hospitals Total density per 100 000 population: Specialized hospitals Total density per 100 000 population: Hospitals
0 Thailand Thailand 15.0 101.0 2020-01-22 2.0 0.0 0.0 2.0 0.693147 ... 51% 0.90% Thailand 2013 42.29 0.0 1.1 0.62 0.13 1.84
1 Thailand Thailand 15.0 101.0 2020-01-23 3.0 0.0 0.0 3.0 1.098612 ... 51% 0.90% Thailand 2013 42.29 0.0 1.1 0.62 0.13 1.84
2 Thailand Thailand 15.0 101.0 2020-01-24 5.0 0.0 0.0 5.0 1.609438 ... 51% 0.90% Thailand 2013 42.29 0.0 1.1 0.62 0.13 1.84
3 Thailand Thailand 15.0 101.0 2020-01-25 7.0 0.0 0.0 7.0 1.945910 ... 51% 0.90% Thailand 2013 42.29 0.0 1.1 0.62 0.13 1.84
4 Thailand Thailand 15.0 101.0 2020-01-26 8.0 0.0 2.0 6.0 2.079442 ... 51% 0.90% Thailand 2013 42.29 0.0 1.1 0.62 0.13 1.84

5 rows × 29 columns

Se calculan variables asociadas al tiempo como día del mes, mes, y día del año equivalente.

In [41]:
df_model['Day'] = df_model['Date'].apply(lambda x:x.day)
df_model['Month'] = df_model['Date'].apply(lambda x:x.month)
df_model['Day_year'] = df_model['Date'].apply(lambda x:x.strftime('%j'))
df_model['Day_year'] = df_model['Day_year'].astype(int)

La variable a explicar es el logaritmo de los casos confirmado del virus.

In [42]:
df_model['Confirmed_log']=df_model['Confirmed'].apply(lambda x: np.log(1+x))
df_model.replace([np.inf, -np.inf], 0, inplace=True)
df_model['Confirmed_log'].describe()
Out[42]:
count    9672.000000
mean        1.716757
std         2.417708
min         0.000000
25%         0.000000
50%         0.000000
75%         3.465736
max        11.124332
Name: Confirmed_log, dtype: float64
In [43]:
df_model.columns
Out[43]:
Index(['Province/State', 'Country/Region', 'Lat', 'Long', 'Date', 'Confirmed',
       'Deaths', 'Recovered', 'Active', 'Confirmed_log',
       'Country (or dependency)', 'Population (2020)', 'Yearly Change',
       'Net Change', 'Density (P/Km²)', 'Land Area (Km²)', 'Migrants (net)',
       'Fert. Rate', 'Med. Age', 'Urban Pop %', 'World Share', 'Country',
       'Year', 'Total density per 100 000 population: Health posts',
       'Total density per 100 000 population: Health centres',
       'Total density per 100 000 population: District/rural hospitals',
       'Total density per 100 000 population: Provincial hospitals',
       'Total density per 100 000 population: Specialized hospitals',
       'Total density per 100 000 population: Hospitals', 'Day', 'Month',
       'Day_year'],
      dtype='object')
In [44]:
X = df_model.drop(columns=['Date', 'Confirmed', 'Deaths', 'Recovered', 
                        'Active', 'Country (or dependency)', 'Migrants (net)',  'Confirmed_log',
                          'Yearly Change', 'Country', 'Year', 'World Share', 'Urban Pop %',
                          'Fert. Rate', 'Med. Age'])
y = df_model['Confirmed_log']
X = pd.concat([X,pd.get_dummies(X['Province/State'], prefix='ps')],axis=1)
X.drop(['Province/State'],axis=1, inplace=True)
X = pd.concat([X,pd.get_dummies(X['Country/Region'], prefix='cr')],axis=1)
X.drop(['Country/Region'],axis=1, inplace=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
(7737, 279) (7737,)
(1935, 279) (1935,)
In [45]:
xgb = XGBRegressor(n_estimators=100)
model1 = xgb.fit(X_train, y_train)
[07:56:12] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
In [46]:
cross_val_score(model1, X_train, y_train, cv=10, scoring='neg_mean_squared_error').mean()
[07:56:29] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:56:43] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:57:00] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:57:14] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:57:29] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:57:43] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:57:59] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:58:15] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:58:30] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:58:45] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Out[46]:
-0.502174834205719
In [47]:
cross_val_score(model1, X_test, y_test, cv=10, scoring='neg_mean_squared_error').mean()
[07:58:59] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:59:03] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:59:07] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:59:12] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:59:16] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:59:20] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:59:23] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:59:27] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:59:31] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[07:59:35] WARNING: C:/Jenkins/workspace/xgboost-win64_release_0.90/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Out[47]:
-0.5902330508278414
In [48]:
plot = plot_importance(model1, height=0.9, max_num_features=20)
plot
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ccb9a53d68>

Las variables con mayor efecto sobre la propagación claramente es el paso del de los días, adicionalmente el volumen de problación del país y la densidad puesto que hace el contagio más probable. También la latitud y longitug tienen una participación importante del problema, de manera que estar ubicados con menor distancia a los primeros lugares de contagio aumenta la posibilidad de contagio.

In [49]:
predictions = model1.predict(X_test)
print('Mean squared error:',mean_squared_error(y_test, predictions))
pd.DataFrame(predictions,y_test).head(15)
Mean squared error: 0.48167405455521767
Out[49]:
0
Confirmed_log
5.533389 5.658398
0.000000 0.293635
0.000000 -0.076757
0.000000 0.848631
0.000000 -0.145511
0.000000 -0.116850
4.744932 4.209949
6.357842 6.102447
4.882802 3.717540
0.693147 1.700402
0.000000 0.016054
0.000000 -0.231668
1.098612 1.395399
0.000000 0.020059
0.000000 1.044463

Para finalizar, procedo a guardar la base de datos utilizada. Sin embargo, no suelo hacer uso de bases externas de datos, sino siempre privadas por lo que dejo comentado el proceso y la forma en la que guardo y leo bases de datos. Creo que podría hacer una prueba específica de sql porque igual el procesamiento aquí se hizo directamente en python antes de poder guardarlo en una base de datos.

Conexion base de datos snowflake

In [ ]:
snowflake_configuration = {
 'url' : 'url.com',
 'account' : '',
 'user' : '', 
 'port' : ,
 'warehouse': '', 
 'password' : '',
 'database' : '' 
}
In [ ]:
engine = create_engine(URL(**snowflake_configuration))
connection = engine.connect()
In [ ]:
con = snowflake.connector.connect(**snowflake_configuration)
connection.execute(f"USE SCHEMA base_prueba;")
connection.execute(f"USE WAREHOUSE NORMAL_LOAD;")

Guardar base de datos

In [ ]:
table_name='covid_table'
schem='base_prueba'
base.to_sql(table_name, engine, if_exists='replace',schema=schem, index=False, chunksize=16383)
connection.execute(f"grant select on table base_prueba.covid_table to role public;")

Lectura base de datos

In [ ]:
snowflake_data = '''
select country/region, province/state,
case when country/region = 'China' then 'China' else 'Rest of the world' end as China,
sum(confirmed) as confirmados, sum(recovered) as recuperados , sum(active) as activos, 
sum(deaths) as muertes, recuperados/confirmados as ts_recuperacion, muertes/confirmados 
as ts_muertes
from base_prueba.covid_table
group by 1,2,3 order by activos
  '''
In [ ]:
df_snowflake = pd.read_sql(snowflake_data, connection)